Syntactic Folding and its Application to the Information Extraction from Web Pages

نویسنده

  • Jörg Herrmann
چکیده

The paper deals with investigations concerning potential structures of documents that will be subject to automated information extraction. The focus is on folding principles and their influence on the recognition of certain data in a document undergoing the extraction. Introduction The topic of our work is information extraction from the Internet. There are a couple of approaches which deal with the problem of recognizing structural data in semistructured documents for retrieval of user specified information from these and from similar documents (possibly of the same source), in an automatic semi-antomatic way (Freitag 1996), (Soderland 1997), (Kushmerick 1997). Ideally, structural information shall be learned by presenting only samples of text segments which a user wants to extract from these pages to a learning device, without any need to specify details of how the desired samples can be localized within the document. The learning device should generate a procedure, a wrapper, that reading the same documents puts out a collection of information, including the samples and, hopefully, extending them in terms of finding similar items. These approaches led to a variety of wrapper classes, e.g. LR-wrappers (Kushmerick 2000), island wrappers (Grieser et al. 2000), T-wrappers (Thomas 1999) and further variants of them with different characteristics, for instance (Hsu & Dung 1998), (Muslea, Minton, Knoblock 1999). Extracting the fundamentals of these approaches, our goal consists in a specification of the area of application for a wrapper under investigation, and in finding out some rules for their conscious selection and generation. On the basis of (Grieser et al. 2000) and a comparison with Kushmerick’s wrapper classes, this publication will focus on the question how tupels, that shall be Copyright (~) 2001, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. extracted from a document in the manner above, can interrelate. It is not our intention to set tile mentioned wrapper concepts into a relation in terms of investigating the acceptable languages in detail. To restrict the domain, we make a couple of assumptions: The information, we are looking for, should be encapsulated by syntactic expressions ttlat are comparable in some manner. We suppose there are a lot of web pages which are organized in a similar way and which make use of the same expressions for structuring information. However, we do not require a specific set of such expressions, so the approach is not restricted to HTML-documents. HTML serves more for azl application domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

بررسی ارتباط بین کیفیت اطلاعات و شاخص های ظاهری در صفحات وب فارسی مرتبط با حوزه سلامت عمومی

  Introduction: One approach to evaluate the quality of a web page is to investigate its external markers. The purpose of the present study is to determine the relationship between information quality of Persian public health web pages and their external quality.   Methods: The samples of this correlation study were selected from among the freely available ten-key word texts of chronic diseases...

متن کامل

Prehospital Trauma Management: Evaluation of a New Designed Smartphone Application

Background: Regarding the undeniable role of nurses in caring for patients with trauma, a suitable method is required for the education of nursing students to meet their needs. In this study, first we designed an educational smartphone application for pre-hospital trauma management and then nursing students evaluated its usefulness.  Methods: This applied study has a cross-sectional design and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001